Synergising single-cell resolution and 4sU labelling boosts inference of transcriptional bursting

Despite the recent rise of RNA-seq datasets combining single-cell (sc) resolution with 4-thiouridine (4sU) labelling, analytical methods exploiting their power to dissect transcriptional bursting are lacking. Here, we present a mathematical model and Bayesian inference implementation to facilitate genome-wide joint parameter estimation and confidence quantification (R package: burstMCMC). We demonstrate that, unlike conventional scRNA-seq, 4sU scRNA-seq resolves temporal parameters and furthermore boosts inference of dimensionless parameters via a synergy between single-cell resolution and 4sU labelling. We apply our method to published 4sU scRNA-seq data and linked with ChIP-seq data, we uncover previously obscured associations between different parameters and histone modifications. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-023-02977-y.

where N is the number of cells in Klein, i refers to the ith cell of Klein and Estimates may be derived from P (α q |λ, y, l k , l q ) with E[α q ] and confidence may be quantified with E[α 2 q ]. Figure S1 indicates high confidence in our estimates through the low CV, while the estimated capture efficiencies for Qiu are lower than for Klein, at around 0.02 on average. A quick, simple method for calculating α q estimates without quantifying confidence is as follows l k,i λ/y i  Figure S1: Density plots of the capture efficiencies measured for the Klein dataset, the total transcript content per cell estimated from Klein, the capture efficiencies estimated for the Qiu dataset and the confidence in those estimates as represented by the CVs.

Conversion rates
Models 2 and 3 also require the gene-specific background and gene-invariant 4sU-mediated T>C conversion rates to be known, λ s and λ n , respectively. As previously mentioned, λ s is defined as the proportion of genomic Ts in all reads and all cells that appeared as Cs in the control dataset for the given gene.
Therefore, the conversion rate observed in the 4sU dataset corresponds to λ s + λ n . The conversion rates of all genes for which we have high confidence in the rate estimate in both datasets are shown in figure S2, which was 6259 genes.
Confidence is obtained by modelling the T>C rate, λ, as classing those with a resulting CV < 10 −0.5 in both datasets as having high confidence. We expect the rate in the 4sU dataset to be at least as large as in the control, hence genes tend to appear on the diagonal or above it. Genes with higher turnover are expected to appear further above the diagonal while those with low turnover are expected to appear closer to it. The curve along the top of the plot represents λ s (x-axis) added to our estimate of λ n . λ n is estimated by first assuming that all reads correspond to new transcripts (synthesised during the pulse) and then calculating for each gene so that p = (p 1 , . . . , p 6259 ) where The estimate for λ n is then the minimum value for which g [p g < 10/6259] < 10. With this approach, λ n ≈ 0.07547. Estimating λ n in this manner relies on an assumption that at least the top ∼ 10 genes in terms of background-

Comparison of gene−specific T>C rates observed with vs without 4sU
Log10 control T>C rate Log10 4sU T>C rate Figure S2: Gene-specific T>C conversion rates in the control vs 4sU datasets for 6259 genes for which we have high confidence in the observed T>C rate in both datasets. The lower red line represents λ s , while the upper one represents λ s + λ n .

Decay rate correlation
Additional confidence in our results is provided by showing the strong correlation between the decay rate estimates we obtained from Qiu for our high confidence genes and those previously calculated in Schofield et al 2018 [3] for the same cell type (figure S3). Log10 decay rate (Schofield) Log10 decay rate (Qiu) Figure S3: Decay rate (δ) estimates we obtained from Qiu for our high confidence gene set vs those calculated by Scofield for the same genes, with the Spearman's rank correlation statistics shown and the diagonal displayed (red line).

Inference on simulated data
In order to assess the performance of the inference algorithm, a dataset was simulated for each of the 12276 genes from the real Qiu dataset whose bursting dynamics were inferred. The θ estimates obtained from Qiu were used as the where α is sampled without replacement from the set of estimated capture efficiencies. Then we have the total UMI count for each cell l = l s + l n and L, where L c is the UMI count of cell c. For each UMI, we draw the number of corresponding reads, r, as where ρ is the maximum likelihood estimate given the observed ratio of reads to UMIs, R, across all cells for the given gene in the real data. ρ is obtained by Now we have y, where y i represents the number of reads with i total conversions in the given cell and Y , where Y c is the vector y for cell c. We may now carry out inference with L and Y as previously described using model 2 (or alternatively model 3, Methods). The same selection of genes that was used for the real data was applied, selecting based on a maximum CV of 0.45 across all parameters.
The correlations between the ground truth values and the parameter estimates

Acceptance rates
Density plots of the MCMC acceptance rates (after burn-in removal) for all 12276 genes analysed in the Qiu dataset and the corresponding simulated genes are shown ( figure S5). This diagnostic indicates that overall the desired mixing behaviour is achieved in all cases for each of the three parameters in our chosen parametrisation, with the ideal acceptance rate being roughly 0.574 [4].

Gene body-localised histone modifications
A similar analysis of the GB-localised HMs was also carried out, where we use H3K36me3 as a representative example, although their metagene profiles and bursting associations are somewhat more diverse than with the four promoterlocalised HMs. H3K4me1 was categeorised as being primarily promoter associated in [5] but we find its connections to transcriptional dynamics instead to be contingent upon its presence throughout the GB, and have therefore reclassified it for this context. The profiles of H3K36me3 halved by the different bursting parameter estimates as before (figure S9) indicate that presence throughout the GB and around the TES seems to be associated with increased µ, a and κ in a uniform manner. No association with b or δ is apparent, suggesting that this HM is associated with increased µ purely through increased κ. In this case, we are able to support the previously reported correlation with a [5], and confirm that the inability of scRNA-seq data to distinguish a and κ did not skew the final conclusions by quantifying the strength (figure S10a) and statistical significance (figure S10b) of H3K36me3 and the other GB-localised HMs which it represents (H3K79me2 and H4K20me1). It should be noted, however, that based on the metagene analysis, while both H3K79me2 (figure S11) and H4K20me1 (figure S12) appear to be primarily associated with increased µ, a and κ, along with H3K4me1 (figure S13), they look to have a positive and negative association with b and δ, respectively, when found throughout the 20%:100% region. However, this is statistically significant only for H3K4me1 (figure S10b), which also has no association with κ and no significant correlation with a. Therefore, for the GB-localised HMs, H3K36me3, H3K79me2 and H4K20me1 can be regarded as similar in their associations with bursting dynamics, while H3K4me1 is an outlier. H3K4me1 is known to be strongly associated with enhancers [6] and, therefore, its presence throughout the portion of the GB downstream of the TSS may signify intronic enhancers, which could enable larger bursts for the gene they are contained within. The regions of association for the GB-localised HMs vary to a degree, as dictated by the metagene analysis, with the values used for the correlation analysis shown in figures S10a and S10b being averaged   For example, the most intense blue indicates that, for the given correlation, 10 −2 < p, meaning no statistical significance, the neutral colour indicates that 10 −4 < p ≤ 10 −3 , while the most intense red indicates that p ≤ 10 −6 .

Additional histone modifications
Analysis of two additional HMs was carried out based on previous work strongly linking them to active enhancer regions [6], which were not analysed in [5]. The metagene for H4K16ac (figure S14) indicates an association with increased burst rate by increased burst frequency, which overcomes an increased decay rate.
Likewise, the increased burst rate overcomes the reduced burst size to result in overall increased expression level. The effects which would reduce gene activity (b and δ) begin at the TSS and appear to grow in strength throughout the GB and towards the TES, eventually neutralising the κ-related positive effect on expression which is uniform throughout the GB and begins upstream of the TSS. On the other hand, the metagene for H3K18ac (figure S15) indicates a positive association with expression level and burst size, along with a possible negative association with decay rate, but no effect related to burst frequency.
We next quantify the strength of the correlations between the HM coverage values and parameter estimates (figure S16a) and their statistical significance (figure S16b), confirming the significance of the positive association of H3K18ac with both expression level and burst size over the -2000:100% region, but not with decay rate. For H4K16ac, we find significant positive associations with expression level, burst rate and burst frequency over the -2000:100% region, and significant positive and negative associations with decay rate and burst size, respectively, over the 0%:+2000 region.   For example, the most intense blue indicates that, for the given correlation, 10 −2 < p, meaning no statistical significance, the neutral colour indicates that 10 −4 < p ≤ 10 −3 , while the most intense red indicates that p ≤ 10 −6 .